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A new family of self-organizing maps, the Winner-Relaxing Koho- 
nen Algorithm, is introduced as a generalization of a variant given 
by Kohonen in 1991. The magnification behaviour is calculated ana- 
lytically. For the original variant a magnification exponent of 4/7 is 
derived; the generalized version allows to steer the magnification in 
the wide range from exponent 1/2 to 1 in the one-dimensional case, 
thus provides optimal mapping in the sense of information theory. The 
Winner Relaxing Algorithm requires minimal extra computations per 
learning step and is conveniently easy to implement. 

1 Introduction 

The self-organizing map (SOM) algorithm (Kohonen 1982) served both 
as model for topology-preserving primary sensory processing in the cortex 
(Obermayer et al. 1992), and for technical applications (Ritter et al. 1992). 
Self-organizing feature maps map an input space, such as the retina or skin 
receptor fields, into a neural layer by feedforward structures with lateral 
inhibition. Defining properties are topology preservation, error tolerance, 
plasticity, and self-organized formation by a local process. Compared to 
other clustering algorithms and vector quantizers its apparent advantage 
for data visualization and exploration is its approximative topology preser- 
vation. In contrast to the Elastic Net (Durbin & Willshaw 1987) and the 
Linsker (1989) Algorithm, which are performing gradient descent in a cer- 
tain energy landscape, the Kohonen algorithm lacks an energy function in 
the general case of a continuous input distribution. Although the learning 
process can be described in terms of a Fokker-Planck equation (Ritter & 
Schulten 1988), the expectation value of the learning step is a nonconser- 
vative force (Obermayer et al. 1992) driving the process so that it has no 
associated energy function. Despite a lot of research, the relationships be- 
tween the Kohonen model and its variants to general principles remain an 
open field (Kohonen 1991). 
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1.1 Kohonen's Self Organizing Feature Map 

Kohonen's Self Organizing Map is defined as follows: Every stimulus v of 
an input space V is mapped to a "center of excitation" , or winner 

s = argmin rei j|w r — v|, (1) 

where |.| denotes the Euclidian distance in input space. In the Kohonen 
model, the learning rule for each synaptic weight vector w r is given by 

£w r = rj ■ g rs - (v-w r ), (2) 

where g rs defines the neighborhood relation in R, and will throughout this 
paper be a Gaussian function of the Euclidian distance |r — s| in the neural 
layer. Topology preservation is enforced by the common update of all 
weight vectors whose neuron r is adjacent to the center of excitation s; 
the adjacency function g rs prescribes the topology in the neural layer. The 
speed of learning rj usually is decreased during the process. 

1.2 The Winner Relaxing Kohonen Algorithm 

We now consider an energy function V first proposed in (Ritter et al. 1992). 
If we have a discrete input space, the potential function for the expectation 
value of the learning step is given by 

nMHV^s £ pK)-|v"-W r | 2 , (3) 

rs M |vfeF s ({w}) 

where _F s ({w}) is the cell of the Voronoi tesselation (or Dirichlet tessela- 
tion) of input space defined by (0). For discrete input space, where p(v) is 
a sum over delta peaks 5(v — v M ), the first derivative w.r.t. w r is not con- 
tinuous at all weight vectors where the borders of the voronoi tesselation 
are shifting over one of the input vectors (Fig.^). However, © requires the 
assumption that none of the borders of the Voronoi tesselation is shifting 
over a pattern vector v M , which may be fulfilled in the final convergence 
phase for discrete input spaces, but becomes problematic if there are more 
receptor positions than neurons. If p(v) is continuous, the sum over \i be- 
comes an integral, and with every stimulus vector update the surrounding 
Voronoi borders slide over stimuli (which means they become represented 
by annother weight vector), so that there is no global energy function for 
the general case. 

We remark that replacing the crisp (or hard) winner selection by a 
soft-winner s = argmin r J2 r ' <?rr'| w r' — v| 2 minimizes even in the con- 
tinuous case (Graepel et al 1997, Heskes 1999). This is a formally elegant 
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Figure 1: Shift of Voronoi borders as an effect of weight vector update. 



approach if one wants to ensure the existence of an energy function and 
accepts to modify the winner selection. 

However, to motivate the Winner Relaxing learning, we return to the 
hard winner selection scheme (JIJ and take up the learning rule given by 
Kohonen (1991). Our use of this ansatz however is justified here only a 
posteriori by its use for adjusting the magnification. 

From the shift of the borders of the Voronoi tesselation F s ({w}) (see 
Fig. [IJ in evaluation of the gradient with respect to a weight vector w r , 
Kohonen (1991) derived for the (approximated) gradient descent in V the 
additive term — ^i]5 rs Y, r '^ s 9r's( v ^ w r') extending (J2J) for the winning neu- 
ron. As it implied an additional elastic relaxation, it was straightforward 
to call it 'Winner Relaxing' (WR) Kohonen algorithm (Claussen 1992). In 
the remainder we study the (generalized) Winner Relaxing Kohonen al- 
gorithm, or Winner Relaxing Self-Organizing Map (WRSOM), introduced 
firstly in (Claussen 1992), in the form 

5w r = rj{(v- w r )g? s - X8 rs #r' s ( v_ w r')}> ( 4 ) 

r jts 

where s is the center of excitation for incoming stimulus v, and g^ s is a 
Gaussian function of distance in the neural layer with characteristic length 
7. Here A is a free parameter of the algorithm. The original algorithm 
(associated with the potential function) proposed by Kohonen in 1991 is 
obtained for A = +1/2, whereas the classical Self Organizing Map Al- 
gorithm is obtained for A = 0. The influence of A on the magnification 
behaviour is the central issue of this paper. 
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1.3 The Magnification Factor 



The magnification factor is defined as the density of neurons r (i.e. the 
density of synaptic weight vectors w r ) per unit volume of input space, and 
therefore is given by the inverse Jacobi determinant of the mapping from 
input space to neuron layer: M = = | det(<iw/<ir)| _1 . We assume the 
input space to be continuous and of same dimension as the neural layer, 
and the map to be noninverting (J > 0). 

The magnification factor quantifies the networks' response to a given 
probability density of stimuli -P(v). To evaluate M in higher dimensions, 
one in general has to compute the equilibrium state of the whole network 
and needs therefore the complete global knowledge on -P(v), except for 
separable cases. For one-dimensional mappings the magnification factor 
can follow an universal magnification law, that is, M(w(r)) is a function 
of the local probability density P only, independent of both the location 
r in the neural layer and the location w(r) in input space. Hereby it is 
nontrivial whether there exists a power law or not; the Elastic Net obeys 
an universal magnification law that remarkably is not a power law (Claussen 
& Schuster 2002) due to a nonvanishing elastic tension in regions of small 
input density. For the classical Kohonen algorithm the magnification law 
is given by a power law M(w(r)) oc P(w(r)) p with exponent p = | (Ritter 
& Schulten 1986). See Tabled for an overview. For a discrete neural layer 
and different neighborhood kernels corrections apply (Ritter 1991, Ritter 
et al. 1992, Dersch& Tavan 1995). 
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Table 1: Magnification laws for one-dimensional maps 



As the brain is assumed to be optimized by evolution for information 
processing, one could conjecture that maximal mutual information can de- 
fine an extremal principle governing the setup of neural structures. For 
feedforward neural structures with lateral inhibition, an algorithm of max- 
imal mutual information has been defined by Linsker (1989) using the gra- 
dient descend in mutual information. It requires computationally costly 
integrations, and has a highly nonlocal learning rule; therefore it is neither 
favourable as a model for biological maps, nor feasible for technical ap- 
plications. Due to realization constraints, both technical applications and 
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cortical networks (Plumbley 1999) are not necessarily capable of reaching 
this optimum. Even if one had experimental data of the magnification be- 
haviour, the question from what self-organizing dynamics neural structures 
emerge, remains. Overall it is desirable to find learning rules that minimize 
mutual information in a simpler way. 

An optimal map from the view of information theory would reproduce 
the input probability exactly (M ~ P(v) p with p = 1), being equivalent to 
the condition that all neurons in the layer are firing with same probability. 
This defines an equiprobabilistic mapping (van Hulle 2000). An exponent 
p = 0, on the other hand, corresponds to a uniform distribution of weight 
vectors, or no adaptation at all. So the magnification exponent is a di- 
rect indicator, how far a Self Organizing Map algorithm is away from the 
optimum predicted by information theory. 



2 Magnification Exponent of the 

Winner-Relaxing Kohonen Algorithm 

We now derive the magnification law of the Winner-Relaxing Kohonen al- 
gorithm (£Q) for the case of a ID— >1D map. Note that for higher dimensions 
analytical results can only be obtained for special degenerate cases of the 
input probability density and therefore lack generality. 

The necessary condition for the final state of the algorithm is that the 
expectation value of the learning step vanishes for all neurons r: 

y reR = J dv p(v)5w r (v). (5) 

Since this expectation value is equal to the learning step of the pattern 
parallel rule, (jSJ) is the stationary state condition for both serial and paral- 
lel updating, and also for batch updating. Thus we can proceed for these 
variants simultaneously (As synaptic plasticity is widely assumed to be 
based on integrative effects, one could claim that a parallel model is suf- 
ficient). The update rule (JU) can be extended by an additional diagonal 
term controlled by p}\ 

5w r = 7]{(v - w r ) ■ g? s + p(v - W r )5 TS 

-XS rs J29} s (v-w r ')}. (6) 

r t=s 

1 Whereas the extra term controlled by the parameter \i has been introduced in 
(Claussen 1992) for pure generality, and will be kept within the derivation, it does 
not contribute to the magnification. In general, the setting fi = 0, is recommended (and 
probably most stable), and the Winner Relaxing Kohonen algorithm thus has only one 
relevant control parameter A. 
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By insertion of the update rule (0) one obtains 

ds P(w(s))J(s)g'J s (w(s) — w(r)) 
+ + A) • [ ds P(w(s))J(s)5 ra (w(s) - w(r)) 



-A-/ Jds dr' P(w(s))J(s)S rs gZ s (w(s)-w(r')) 
ds P(w(s))J(s)g J rs (w(s) - w(r)) 

+A ■ P(w(r))J(r) ■ / dr' gZ (w(r) - w(r)). (7) 



The derivation can be performed analoguous to (Ritter, Martinetz & Schul- 
ten 1992). In the continuum limit there is always an exactly matching win- 
ning weight vector w s = v. Further the integration variable is substituted, 
dv = dw s = J(s)ds, and we define the abbreviation P := P{w{r)). In the 
first integrand PJ has to be expanded in powers of q := s — r. Within the 
second integral PJ is evaluated only at r. Thus the integration yields in 
leading order in q: 



dr dr 2 dr 2 J 

/dw o 2 d 2 w) 



contribution 



' \ dr 2 dr 2 drj v 7 

Further we have to require 7 ^ 0, P ^ 0, dP/dr 7^ 0. Then the ansatz of 
an universal local magnification law J(r) = J(P(r)), i.e. J depends only 
on the local value of P, that may be expected for the one- dimensional case 
only, requires J to fulfill the differential equation 

J 1 A, dJ 

° = P + < 1 + 2 + 2>dP (9) 

or 

dJ 2 J 



iP 3 + AP- (10) 

It has a power law solution, (provided that A 7^ —3), which verifies the 
ansatz made above, J being a function of the local density only, 

1 _2_ 

M=-^P( v )3+x. (11) 

'J 
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Thus the magnification exponent is given by -^-^ and can be tuned from 
1/2 to 1 (see Fig. |2J) within the range of stability. 




Figure 2: Impact of parameter A on the magnification exponent. The cases of 
A = 1/2 (Kohonen 1991), the SOM case A = (Kohonen 1982) and the "winner 
enhancing" choice A = — 1 are marked with dots. 

For the A = 1/2 choice of the Winner- Relaxing Kohonen Algorithm 
the magnification factor follows an exact power law with magnification 
exponent p = 4/7, which is smaller than p = 2/3 for the classical Self 
Organizing Feature Map (Ritter & Schulten 1986), but is still much larger 
than p = 1/3 for Vector Quantization and Neural Gas. In any case, the 
maps resulting from the choices A = 1/2 and A = are not optimal in 
terms of information theory. 



3 Enhancing the Magnification 

From this result one would try to invert the Relaxing Effect by choice of 
negative values for A, which means to "enforce" the winner. In fact, the 
choice of A = — 1 leads to the magnification exponent 1. 

The magnification law (fTTj) is verified numerically as is shown in Ta- 
ble El Apart from the fact that the exponent can be varied by a priori 
parameter choice between 1/2 and 1, the simulations show that our Winner 
Relaxing Algorithm is able to establish information-theoretically optimal 
self-organizing maps in the "winner enforcing" case (A < 0). 
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Table 2: Magnification exponent of the Winner Relaxing Algorithms determined 
numerically from a sample setup with 200 neurons and 2 • 10 7 update steps and a 
learning rate of 0.1. The input space was the unit interval, the stimulus probability 
density was chosen exponentially as exp(— j3w) with (3 = 4. After an adaptation 
period of 5 • 10 7 learning steps further 10% of learning steps were used to calculate 
average slope and its fluctuation of log J as a function of log P. (The first and last 
10% of neurons were excluded to eliminate boundary effects). The small numbers 
denote the fluctuation of the exponent through the final 10% of the experiment. 
For small 7, the neighborhood interaction becomes too weak. If the Gaussian neigh- 
borhood extends over some neurons (7 = 5), the exponent follows the predicted 
dependence of 7 given by 2/(3 + A). For |A| > 1 the system is instable, this is the 
case where the additional update term of the winner is larger than the sum over 
all other update terms in the whole network. Tuning of the parameter \x did not 
seem to extend the region of stability. As the relaxing effect is inverted for A < 0, 
fluctuations are larger than in the Kohonen case. 

4 Ordering time and Stability Region 

At least for the 2D— >2D case, the Winner- Relaxing Kohonen Algorithm 
was reported as 'somewhat faster' (Kohonen 1991) in the initial ordering 
process. In a ID— >1D sample setup (Claussen 2003), a marginally quicker 
ordering was observed for negative A, at least at a relatively high learning 
rate rj = 1. As a lot of parameters and the input distribution itself influ- 
ence the ordering time and decay of fluctuations, different results may be 
obtained; e.g., a small fraction of input distributions containing topolog- 
ical kinks take much longer to become ordererd, thus minimal, maximal, 
averaged, and inverse averaged ordering time will deviate. 

If one instead investigates the time dependence of the fluctuations, for 
positive values of A a considerably quicker decay is observed (Fig. EJ), being 
consistent with the observation by Kohonen (1991) mentioned above. 
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Number of iterations 



Figure 3: Time dependence (every tenth iterate shown) of the log rms fluctuations 
for different A. Here the same setup of a single run with 7 = 1.0, rj = 0.1 and 
10 neurons is being used; each run starts with the same configuration and random 
initial values between and 1. For A > a quicker ordering is observed. 




Number of iterations 

Figure 4: Fast learning using a simple switching strategy. Starting with A = 1/2, 
ordering is acheived quickly. At iteration step 2000, A is immediately changed to -1 
(dotted). This speeds up the learning phase by two orders of magnitude compared 
to starting with A = —1, and by a factor 4 compared to A = (dashed, shown for 
comparison). If the duration of the initial ordering phase is underestimated, again a 
long learning phase results (solid line; switch at step 200). 
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These simulations indicate that for obtaining optimal magnification, the 
price of a longer learning phase may have to be paid. However, this draw- 
back can be circumvented by combining the advantages of both A ranges; 
i.e. using A > 1 in the initial phase to speed up ordering, and switching 
to A = — 1 after a considerable decay of fluctuations (Fig. HJ). No compli- 
cated time-dependence of this parameter switch has been used, and neither 
learning rate nor neighborhood have been changed during the simulation. 

The last important issue to be addressed is the dependence of stability 
on the parameter A, especially at the border —1. Fortunately, the algorithm 
appears to be stable (in the ID— >1D case) in the whole range — 1 < A < +1, 
as shown in Fig. On both borders the Winner Relaxing learning remains 
stable. Thus, the full range of magnification exponents between 1/2 and 1 
can be acheived. 




Figure 5: For /x € [— 1,+1] the common stability range is A G [— 1,+1]. For 
A < —1, the log rms of the weight vector differences w r — u> r -i diverges, but 
extremely long quiet transients are observed there. In the upper range A > +1, 
making use of the diagonal term by using [i / extends the stability range. The plots 
correspond to 10 (straight), 10 6 (dash-dotteded), and 10 5 (dashed), respectively, for 
fi = 0. For 10 iterations, also the cases \i = — 1 (thin dots) and fi = +1 (thick 
dots) are shown. Parameters are 7 = 1.0, i] = 0.1, and 10 neurons are initialized 
near an equidistant chain with noise of amplitude 0.01 added. 
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In higher dimensions no universal magnification law is expected, but 
one can evaluate the output entropy for a given input distribution and 
network. As shown in Fig. |3 the enhancement of output entropy by Win- 
ner Relaxing learning is effective also in the two dimensional case, where 
however parameters have to be chosen more carefully. 




Figure 6: Entropy enhancement for the 2D^2D case for network geometries of 
10 * 10 and 5 * 20 neurons. The data density was sin(7TDi) • sin(7TU2) within the unit 
square, 7 = 5.0, and rj was decreased from 0.01 to 0.001 during 10 6 learning steps. 
Alternatively, batch learning (over 100 steps) has been used; here 77 was decreased 
from 0.05 to 0.001, 7 = 2.0 in the first 2 • 10 5 steps ordinary SOM learning was 
applied (7 = 5.0, A = 0). In all cases, for A = — 1 the entropy is enlarged compared 
to the unmodified case A = 0, and close to the optimum (In 100 = 4.605). 



5 Discussion 

After our first study (Claussen 1992), Herrmann et al. (1995) introduced 
annother modification of the learning process, which was also applied to 
the Neural Gas algorithm (Villmann & Herrmann 1998). Their central idea 
is to use a learning rate 77 being locally dependent on the input probability 
density and also an exponent 1 can be obtained. As the input probability 
density should not be available to a neural map that self-organizes from 
stimuli drawn from that distribution, it is estimated from the actual lo- 
cal reconstruction mismatch (being an estimate for the size of the Voronoi 
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cell) and from the time elapsed since the last time being the winner. Both 
operations require additional memory and computation, and, due to the 
estimating character, the learning rate has to be bounded in practical use. 
This localized learning was overall easier applicable and overcame the sta- 
bility problems of the early approach of conscience learning (deSieno 1988). 

Another systematic method, the extended Maximum Entropy Learning 
Rule, has been introduced by van Hulle (1997). It approximates a map of 
maximal output entropy for arbitrary dimension, alhough in higher dimen- 
sions the handling of the quantization regions becomes less practial (van 
Hulle 1998). A quite different approach being also capable of generating 
equiprobabilistic maps is via kernel optimization (van Hulle 1998, 2000, 
2002), i.e. neighborhood kernel radii themselves become learning parame- 
ters, in addition to the weight vectors defining the kernel centers. Other 
approaches, also influencing magnification, consider the selection of the 
winner to be probabilistic, leading to elegant statistical approaches to po- 
tential functions, as given by Graepel et al. (1997) and Heskes (1999). 

As shown recently (Claussen & Villmann 2004), the Winner Relaxing 
concept can also be transferred successfully to the Neural Gas, confirming 
the utility of this class of learning rules. 

6 Conclusions 

The Linsker, Elastic Net and Winner-Relaxing Kohonen algorithms can be 
derived from an extremal principle, given by information theory, physical 
motivations, and reconstruction error, respectively. In this paper we have 
chosen the magnification law to indicate how close the algorithm reaches 
the adaptation properties of a map of maximal mutual information. The 
magnification law is one quantitative property that both is accessible by 
neurobiological experiments and manifests as a quantitative control param- 
eter of a neural map used as vector quantizer in applications. A map of 
maximal mutual information uses all neurons with same probability, i.e. 
their firing rate will be equal. 

In this work we have investigated the Winner Relaxing approach to 
establish a new family of vector quantizers. The shift from Kohonen 
(p = 2/3) to Winner Relaxing Kohonen algorithm (p = 4/7) seems to 
be marginal, if the emphasis is laid on the existence of a potential func- 
tion. If a large magnification exponent is desired, the Winner Relaxing 
Kohonen Algorithm (with A = — 1) combines simple computation with a 
magnification corresponding to maximal mutual information. 
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